Skip to content

BUG: resampling with NaT in TimedeltaIndex (#13223) #14649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 354 commits into from

Conversation

paulgliu
Copy link

@codecov-io
Copy link

Current coverage is 85.28% (diff: 100%)

Merging #14649 into master will increase coverage by <.01%

@@             master     #14649   diff @@
==========================================
  Files           140        140          
  Lines         50693      50698     +5   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43234      43240     +6   
+ Misses         7459       7458     -1   
  Partials          0          0          

Powered by Codecov. Last update 1d6dbb4...d112e73

@codecov-io
Copy link

codecov-io commented Nov 14, 2016

Codecov Report

Merging #14649 into master will increase coverage by 4.64%.
The diff coverage is 96.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #14649      +/-   ##
==========================================
+ Coverage   86.33%   90.97%   +4.64%     
==========================================
  Files         139      145       +6     
  Lines       51149    49481    -1668     
==========================================
+ Hits        44157    45014     +857     
+ Misses       6992     4467    -2525
Flag Coverage Δ
#multiple 88.73% <96.02%> (?)
#single 40.67% <41.8%> (?)
Impacted Files Coverage Δ
pandas/stats/moments.py 71.19% <ø> (ø) ⬆️
pandas/io/html.py 84.85% <ø> (+0.36%) ⬆️
pandas/io/sas/sasreader.py 85.18% <ø> (+77.77%) ⬆️
pandas/core/config.py 88.09% <ø> (ø) ⬆️
pandas/io/feather_format.py 86.66% <ø> (ø) ⬆️
pandas/sparse/list.py 97.1% <ø> (ø) ⬆️
pandas/indexes/numeric.py 97.1% <ø> (-0.05%) ⬇️
pandas/parser.py 100% <ø> (ø)
pandas/indexes/range.py 92.11% <ø> (+0.02%) ⬆️
pandas/compat/numpy/__init__.py 93.93% <ø> (ø) ⬆️
... and 156 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c26e5bb...5372175. Read the comment docs.

@jreback jreback added Bug Resample resample method Timedelta Timedelta data type labels Nov 15, 2016
binner = binner.insert(0, tslib.NaT)
labels = labels.insert(0, tslib.NaT)

n_NaT = sum([ax_i is tslib.NaT for ax_i in ax])
Copy link
Contributor

@jreback jreback Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_NaT = ax._isnan.sum()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change. Thanks

@@ -1204,6 +1206,13 @@ def _get_time_delta_bins(self, ax):
end_stamps = labels + 1
bins = ax.searchsorted(end_stamps, side='left')

if ax.hasnans:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though we actually need to ignore the nans, doesn't this actually create a NaT group? this is inconcistent with grouping.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaT group will be ignored in aggregation. It is handled the same way as in function _get_time_bins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here on what you are doing

@@ -970,6 +970,15 @@ def test_resample_timedelta_idempotency(self):
expected = series
assert_series_equal(result, expected)

def test_resample_timedelta_missing_values(self):
# GH 13223
index = pd.to_timedelta(['0s', pd.NaT, '2s'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test with an all NaT index

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

@paulgliu paulgliu closed this Nov 18, 2016
@paulgliu paulgliu reopened this Nov 18, 2016
@paulgliu paulgliu closed this Nov 18, 2016
@paulgliu paulgliu reopened this Nov 18, 2016
@@ -1189,13 +1190,18 @@ def _get_time_delta_bins(self, ax):
raise TypeError('axis must be a TimedeltaIndex, but got '
'an instance of %r' % type(ax).__name__)

if len(ax) > 0 and all(ax._isnan):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would give a better error message here. Is this consistent with how we handle all-nan grouping?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All-nan grouping doesn't seem to be handled elsewhere. Any suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we currently ignore all nan groups, that's why I think it is there, so this is consistent, maybe a comment is worth it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is all data is nan. Hmm, I would just give a better error message then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ax.hasnans

@jreback
Copy link
Contributor

jreback commented Dec 26, 2016

can you rebase.

# all NaT
index = pd.to_timedelta([pd.NaT, pd.NaT, pd.NaT])
series = pd.Series([2, 3, 5], index=index)
self.assertRaises(DataError, series.resample('1s').mean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the right idea (e.g. raising a nice message),

e.g. we do this for all-nan groups (which is unfriendly)

In [8]: s = Series([1,2,3],[np.nan,np.nan,np.nan])

In [9]: s.groupby(s.index).sum()
TypeError: unhashable type: 'Float64Index'

@paulgliu
Copy link
Author

paulgliu commented Feb 1, 2017

Rebased

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. some doc fixups

pls add a whatsnew entry in bug fixes (0.20.0).

ping on green.

@@ -1189,13 +1190,18 @@ def _get_time_delta_bins(self, ax):
raise TypeError('axis must be a TimedeltaIndex, but got '
'an instance of %r' % type(ax).__name__)

if len(ax) > 0 and all(ax._isnan):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ax.hasnans

@@ -1204,6 +1206,13 @@ def _get_time_delta_bins(self, ax):
end_stamps = labels + 1
bins = ax.searchsorted(end_stamps, side='left')

if ax.hasnans:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here on what you are doing

if not len(ax):
binner = labels = TimedeltaIndex(
data=[], freq=self.freq, name=ax.name)
return binner, [], labels

start = ax[0]
end = ax[-1]
# Addresses GH #13223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on what you are doing here (and why not just selecting start/end)

jorisvandenbossche and others added 12 commits February 25, 2017 22:38
In `io/parsers/_try_convert_dates()` when selecting
columns based on a  column index from a set of columns with multi-
level names, the column  `name` was converted to a string.  This
appears to be a bug since the  `name` was a tuple before the
conversion.  This causes problems  downstream when there is an attempt
to use this name to lookup a  column, and that lookup fails because
the desired column is keyed from  the tuple, not its string
representation

closes pandas-dev#15376

Author: Stephen Rauch <[email protected]>

Closes pandas-dev#15378 from stephenrauch/fix_read_csv_merge_datetime and squashes the following commits:

030f5ec [Stephen Rauch] BUG: Parse two date columns broken in read_csv with multiple headers
closes pandas-dev#15426

Author: Stephen Rauch <[email protected]>

Closes pandas-dev#15433 from stephenrauch/tz-lost-in-groupby-agg and squashes the following commits:

64a84ca [Stephen Rauch] BUG: GH15426 timezone lost in groupby-agg with cython functions
in assert_frame_equal, if check_like, the former code reindex_like
before shape comparison.  for example:  if left.shape=(2,2),
right.shpae=(2.0), after reindex_like,
left.shape=(2,0),right.shape=(2,0),then the shape comparison will not
find out that the two dataframes are different. For that, the
assert_frame_equal will not raise assertion errors. But in fact it
should raise.

Author: jojomdt <[email protected]>

Closes pandas-dev#15496 from jojomdt/master and squashes the following commits:

7b3437b [jojomdt] fix test_frame_equal_message error
0340b5c [jojomdt] change check_like description
c03e0af [jojomdt] add test for TestAssertFrameEqual
470dbaa [jojomdt] combine row and column shape comparison
ce7bd74 [jojomdt] reindex_like after shape comparison
…_layout() (pandas-dev#15515)

* Add unit test for pandas-dev#9351

* Tweaks.

* add _check_plot_works; rm aux method

* Add whatsnew entry.
closes pandas-dev#15347

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15484 from jreback/gbq and squashes the following commits:

0fd8d06 [Jeff Reback] wip
3222de1 [Jeff Reback] CLN: remove pandas/io/gbq.py and tests and replace with pandas-gbq
The transform() operation needs to return a like-indexed. To
facilitate this, transform starts with a copy of the original series.
Then, after the computation for each group, sets the appropriate
elements of the copied series equal to the result. At that point is
does a type comparison, and discovers that the timedelta is not cast-
able to a datetime.

closes pandas-dev#10972

Author: Jeff Reback <[email protected]>
Author: Stephen Rauch <[email protected]>

Closes pandas-dev#15430 from stephenrauch/group-by-transform-timedelta-from-datetime and squashes the following commits:

c3b0dd0 [Jeff Reback] PEP fix
2f48549 [Jeff Reback] fixup slow transforms
cc43503 [Stephen Rauch] BUG: GH15429 transform result of timedelta from datetime
…-dev#15523)

* DOC: Update contributing for test_fast, fix doc Windows build

* add pip install for xdist
jreback and others added 26 commits April 3, 2017 13:25
* DOC: update contributing.rst for ci

* typos & auto-cancel links

* make it a note

* add back accid deleted section
When cleaning `na_values` during initialization of `TextFileReader`,
we return a `list` whenever we specify that `na_values` should be
empty.  However, the rest of the code expects a `set`.

Closes pandas-dev#15835.

Author: gfyoung <[email protected]>

Closes pandas-dev#15881 from gfyoung/keep-default-na-excel and squashes the following commits:

0bb6f64 [gfyoung] BUG: Patch handling no NA values in TextFileReader
closes pandas-dev#14800

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15541 from jreback/exceptions and squashes the following commits:

e5fbdc8 [Jeff Reback] give nicer deprecation / message on infer_dtype moving
ab4525b [Jeff Reback] typo on pandas.errors in whatsnew
d636ef7 [Jeff Reback] document removed exceptions
3dc4b9a [Jeff Reback] more docs for exceptions
2bb1fbd [Jeff Reback] remove AmbiguousIndexError, completely unused
5754630 [Jeff Reback] fix doc-string
35d225f [Jeff Reback] more examples
e91901d [Jeff Reback] DOC: better docs on infer_type
7e8432d [Jeff Reback] remove need for PandasError sub-class
92b2fdc [Jeff Reback] corrections
991fbb4 [Jeff Reback] API: expose pandas.errors
eec40cd [Jeff Reback] add pandas.api.lib add infer_dtype to pandas.api.lib
xref pandas-dev#12640
xref pandas-dev#14876

Author: Aleksey Bilogur <[email protected]>

Closes pandas-dev#15521 from ResidentMario/12640 and squashes the following commits:

1657246 [Aleksey Bilogur] two doc changes
28a38f2 [Aleksey Bilogur] tweak whatsnew entry.
5f306a9 [Aleksey Bilogur] +whatsnew
ff895fe [Aleksey Bilogur] Add tests, update docs.
11f3fe4 [Aleksey Bilogur] rm stray debug.
3cbbed5 [Aleksey Bilogur] Melt docstring.
d54dc2f [Aleksey Bilogur] +pd.DataFrame.melt.
… like

closes pandas-dev#15869

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15892 from jreback/construct and squashes the following commits:

6bf2148 [Jeff Reback] fix perf
7fcd4e5 [Jeff Reback] BUG: Bug in DataFrame construction with nulls and datetimes in a list-like
* Citing source in README file

For GH users who strictly or heavily use the web-view instead of a local Git, having a direct link is handy, as it does not require downloading the PDF _if_ the user wanted to go to the source of it directly. It's an alternative that allows those interested in more uploads similar to this PDF from the same author(s).

* jorisvandenbossche's feedback

I re-read the PDF and made sure the wording reflected the content presented. I also changed the source-citing so that is more friendly for .TXT files instead of Markdown or unspecified.

* Update README.txt

* English enhancement

Improved sentence structure for English speakers.
xref pandas-dev#15299

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15902 from jreback/series_n and squashes the following commits:

657eac8 [Jeff Reback] TST: better testing of Series.nlargest/nsmallest
1) Allows for more uniform handling of invalid file buffers to our `read_*` functions.
2) Adds a ton of new documentation to `inference.py`

Closes pandas-dev#15337.
xref pandas-dev#15895.

Author: gfyoung <[email protected]>

Closes pandas-dev#15894 from gfyoung/validate-file-like and squashes the following commits:

5a8f8da [gfyoung] DOC: Document all of inference.py
81103f7 [gfyoung] ENH: Add file buffer validation to I/O ops
 - [x] closes pandas-dev#14855    - [x] tests passed   - [x] passes ``git diff
upstream/master | flake8 --diff``

Author: alexandercbooth <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Tom Augspurger <[email protected]>

Closes pandas-dev#14871 from alexandercbooth/fix-color-scatterm-bug and squashes the following commits:

3245f09 [alexandercbooth] DOC: moving whatsnew entry to 0.20.0
8ff5f51 [alexandercbooth] BUG: addresses pandas-dev#14855 by fixing color kwarg conflict
* DOC: Fix a typo in indexing.rst

* more typos fixed
closes pandas-dev#15297

Author: Roger Thomas <[email protected]>

Closes pandas-dev#15299 from RogerThomas/fix_nsmallest_nlargest_with_n_identical_values and squashes the following commits:

d3964f8 [Roger Thomas] Fix nsmallest/nlargest With Identical Values
* CLN: clean up select_n algos

* CLN: clean ensure_data

closes pandas-dev#15903

* return ndtype, so can eliminate special cases

* unique

* fixups
@paulgliu
Copy link
Author

paulgliu commented Apr 7, 2017

rebased

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

when you rebase you should end up with a small number of commits.

try

git rebase -i yourbranchname origin/master
git push yourremote yourbranchname

@jreback
Copy link
Contributor

jreback commented May 7, 2017

closing. this needs a new PR with a cherry-pick on top of master.

@jreback jreback closed this May 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resampling with NaT in TimedeltaIndex raises MemoryError